Multiple Linear Regression Models

STA 101L - Summer I 2022

Raphael Morsomme

Welcome

Announcements

  • assignments

Recap

  • simple linear regression model \[ \text{hwy} \approx \beta_0 + \beta_1 \text{cty} \]
  • residuals
  • least-square estimates
  • parameter interpretation
  • model comparison with \(R^2\)
  • outliers

Outline

  • multiple linear regression
  • feature engineering
  • model comparison
  • predictive performance

Multiple linear regression

Remember, to improve our initial model with \((\beta_0, \beta_1) = (1, 1.3)\), we could

  1. find better estimates,
  2. use additional predictors

For (i), the least-square estimates are usually very good (previous lecture); for (ii), we use a multiple linear regression model.

For instance, to predict hwy we could include multiple variables in our model.

The mpg data set

d <- ggplot2::mpg
head(d, n = 4)
# A tibble: 4 x 11
  manufacturer model displ  year   cyl trans      drv     cty   hwy fl    class 
  <chr>        <chr> <dbl> <int> <int> <chr>      <chr> <int> <int> <chr> <chr> 
1 audi         a4      1.8  1999     4 auto(l5)   f        18    29 p     compa~
2 audi         a4      1.8  1999     4 manual(m5) f        21    29 p     compa~
3 audi         a4      2    2008     4 manual(m6) f        20    31 p     compa~
4 audi         a4      2    2008     4 auto(av)   f        21    30 p     compa~

Instead of fitting a model with only cty or only displ, we could fit a linear regression model with both predictors!

Linear regression with 2 predictors

The model equation is

\[ \text{hwy} \approx \beta_0 + \beta_1 \text{cty} + \beta_2 \text{displ} \]

We can find the least-square estimates (minimizing the SSR) with the command lm

lm(hwy ~ cty + displ, data = d)

Call:
lm(formula = hwy ~ cty + displ, data = d)

Coefficients:
(Intercept)          cty        displ  
    1.15145      1.32914     -0.03432  

which give the following regression equation

\[ \text{hwy} \approx 1.15 + 1.33 \text{cty} - 0.034 \text{displ} \]

Interpretation

  • Interpreting \(\hat{\beta}_1 = 1.33\):
    • “Keeping displ constant, for each additional unit in cty, we would expect hwy to be higher, on average, by 1.33 units.”
  • Interpreting \(\hat{\beta}_2 = -0.034\):
    • “Keeping cty constant, for each additional unit in displ, we would expect hwy to be lower, on average, by 0.034 unit.”
  • Interpreting \(\hat{\beta}_0 = 1.15\):
    • “For a car with cty and displ equal to 0, we would expect hwy to be 1.15.”
    • meaningless in this context.

Special case: categorical predictor

In a regression model, categorical predictors are represented using indicator variables.

To represent a categorical predictor with \(k\) levels (categories), we use \((k-1)\) indicator variables.

(advanced topic: identifiability)

Including drv

For instance, the categorical variable drv has \(k=3\) levels (4, f and r), so we can represent it with \(3-1=2\) indicator variables with the following model equation

\[ \text{hwy} \approx \beta_0 + \beta_1 \text{drv_f} + \beta_2 \text{drv_r} \]

Tip

For a binary variable \(k=2\), so we only need \(k-1=1\) indicator variable.

For instance, to include the binary variable year_binary, we only added a single indicator variable to the model.

Using categorical predictors in R

lm(hwy ~ drv, data = d)

Call:
lm(formula = hwy ~ drv, data = d)

Coefficients:
(Intercept)         drvf         drvr  
     19.175        8.986        1.825  

The model equation with the least-square estimates is \[ \text{hwy} \approx 19.175 + 8.986 \text{drv_f} + 1.825 \text{drv_r} \]

Interpreting the output

\[ \text{hwy} \approx 19.175 + 8.986 \text{drv_f} + 1.825 \text{drv_r} \]

If a new vehicle has a drv that is:

  • 4, then the two indicator variables drv_f and drv_r take the value 0, and the prediction is \(\widehat{hwy} = 19.175\)
  • f, then the indicator variables drv_f takes the value 1 and drv_r the value 0, and the prediction is \(\widehat{hwy} = 19.175 + 8.986 = 28.161\)
  • r, then the indicator variables drv_f takes the value 0 and drv_r the value 1, and the prediction is \(\widehat{hwy} = 19.175 + 1.825 = 21\)

The level 4 is called the reference level.

Group exercise - categorical predictor

  • Exercises 8.5, 8.7

  • In addition

    1. identify the baseline level of the categorical variable in each model.
    library(openintro)
    d <- openintro::births14
    1. fit the first models in R
05:00

Fitting a larger model

Let us fit a model with cty, drv and disp

m_larger <- lm(hwy ~ cty + drv + displ, data = d)
m_larger

Call:
lm(formula = hwy ~ cty + drv + displ, data = d)

Coefficients:
(Intercept)          cty         drvf         drvr        displ  
      3.424        1.157        2.158        2.360       -0.208  

Its \(R^2\) is 0.938.

glance(m_larger)$r.squared
[1] 0.9384357

Unsurprisingly, including additional predictors makes the regression line closer to the points \(\Rightarrow\) residuals are smaller \(\Rightarrow\) SSR is smaller \(\Rightarrow\) \(R^2\) is larger.

Group exercise - computing \(R^2\) by hand

  • What is the formula for \(R^2\)?
  • Compute \(R^2\) by hand in R (do not use glance). You can use the following code.
d <- ggplot2::mpg
m_larger <- lm(hwy ~ cty + drv + displ, data = d)
m_augment <- augment(m_larger)
04:00

Fitting the full1 model

m_full <- lm(hwy ~ manufacturer + displ + year + cyl + trans + drv + cty + fl + class, data = d)

Thanks to the additional predictors, the residuals are very small, making \(R^2\) close to \(1\).

glance(m_full)$r.squared
[1] 0.9773386

We will see in the next lecture that this is not always a good sign.

Group exercise - multiple linear regression

Exercise 8.9

  1. fit the model in R
  2. identify the type of each variable
  3. identify the baseline level of the categorical predictors
  4. do parts a-d
library(openintro)
d_birth <- openintro::births14
05:00

Statistics as an art – feature engineering

Feature engineering

We saw that adding predictors to the model seems to help.

However, the variables included in the data set, e.g. displ, year, etc, may not be the most useful predictors for hwy.

Feature engineering refers to the creation of new predictors from existing ones.

Tip

This is where your understanding of the data, scientific knowledge, etc, makes a big difference.

Transforming a variable

Consider the predictor displ

ggplot(d) +
  geom_point(aes(displ, hwy))

The relation bwteen displ and hwy is not exactly linear.

Let us include the predictor \(\dfrac{1}{\text{displ}}\) to capture this nonlinear relation.

Here is the model equation

\[ \text{hwy} \approx \beta_0+ \beta_1 \text{displ} + \beta_2 \dfrac{1}{\text{displ}} \]

The least-square coefficient estimates are

d_transf <- mutate(d, displ_inv = 1/displ)
m <- lm(hwy ~ displ + displ_inv, data = d_transf)
m

Call:
lm(formula = hwy ~ displ + displ_inv, data = d_transf)

Coefficients:
(Intercept)        displ    displ_inv  
    12.2601      -0.2332      36.1456  

And the regression lines captures the nonlinear relation.

augment(m) %>%
  ggplot(aes(displ, hwy)) +
  geom_point() +
  geom_line(aes(y = .fitted))

The trees data set

  • Measurements of the diameter, height and volume of timber in 31 felled black cherry trees.
  • Note that the diameter (in inches) is erroneously labelled Girth in the data
  • The diameter (Girth) is measured at 4 ft 6 in above the ground.
  • Source: Atkinson, A. C. (1985) Plots, Transformations and Regression. Oxford University Press.

Combining variables

Transforming a variable may be helpful …but we can go a step further!

We can construct new predictors by combining existing variables.

Consider the trees dataset

d_tree <- datasets::trees
head(d_tree, n = 4)
  Girth Height Volume
1   8.3     70   10.3
2   8.6     65   10.3
3   8.8     63   10.2
4  10.5     72   16.4

where Girth indicates the tree diameter (twice the radius) in inches.

Suppose we want to estimate Volume (expensive to measure) from Girth and Height (cheap to measure).

You might decide to approximate the shape of a tree with a shape that is between a cylinder and a cone.

From geometry, we know that the volume of a cylinder is

\[ V = \pi r^2 h \]

and that of a cone is

\[ V = \frac{1}{3} \pi r^2 h \]

This suggests approximating the volume of a tree with the following model

\[ \text{Volume} = \beta_1 \left[ \left(\dfrac{\text{Girth}}{2}\right)^2 * \text{Height}\right] = \beta_1X \]

where

  • \(\beta_1\) is an unknown parameter that we expect to be between \(\pi\) (pure cylinder) and \(\frac{1}{3}\pi\) (pure cone)
  • \(X = \left(\dfrac{\text{Girth}}{2}\right)^2 * \text{Height}\) is our predictor, which we construct from existing variables.

To accomplish this, we simply need to create a new variable corresponding to \(\left(\dfrac{\text{Girth}}{2}\right)^2 * \text{Height}\).

But before doing that, we have to transform Girth into feet to ensure that all variables have the same units.

d_tree_comb <- d_tree %>%
  mutate(
    Girth_ft    = Girth / 12,
    radius      = Girth_ft / 2,
    r_squared   = radius^2,
    r2h = r_squared * Height
  )
head(d_tree_comb, n = 3)
  Girth Height Volume  Girth_ft    radius r_squared      r2h
1   8.3     70   10.3 0.6916667 0.3458333 0.1196007 8.372049
2   8.6     65   10.3 0.7166667 0.3583333 0.1284028 8.346181
3   8.8     63   10.2 0.7333333 0.3666667 0.1344444 8.470000

My model does not include an intercept. To exclude the intercept from the model, I use -1 in the lm command.

m <- lm(Volume ~ r2h - 1, data = d_tree_comb)
m

Call:
lm(formula = Volume ~ r2h - 1, data = d_tree_comb)

Coefficients:
  r2h  
1.214  

This is between the two anticipated bounds \(\pi=3.14\) and \(\frac{\pi}{3}=1.047\).

Warning

When we do not include the intercept in a model, \(R^2\) measures something different and should therefore not be interpreted in the usual way.

glance(m)$r.squared
[1] 0.9950219

Comparison with the full model

Although \(R^2\) has a different meaning when there is no intercept, it can be still used for comparison1.

m_full <- lm(Volume ~ Girth + Height -1, data = d_tree)
glance(m_full)$r.squared
[1] 0.9696913

Our new variable \(\left(\dfrac{\text{Girth}}{2}\right)^2 * \text{Height}\) improves the model!

Note that the full model has two predictors, while our geometry-based model only has a single predictor!

Tip

A carefully constructed predictor can do a better job than multiple raw predictors!

Special case of data combination: interaction

Suppose you are interested in predicting the average run time (duration) of amateur jogging races.

Two variables that impact the duration are (i) the distance of the race, and (ii) the weather.

For simplicity, we measure the weather as either good (nice weather) or bad (rain, heat wave, etc).

Our model is therefore

\[ \text{duration} \approx \beta_0 + \beta_1 \text{distance} + \beta_2 \text{weather_bad} \]

where \(\beta_1\) indicate the effect of an additional miles on the expected duration and \(\beta_2\) the effect of bad weather.

Note that the effect of weather is fixed in this model, say “\(+5\) minutes” if \(\hat{\beta}_2 = 5\).

Is this reasonable? No!

\(\Rightarrow\) the effect of weather should vary with distance. For shorter races, bad weather may add only 2 or 3 minutes, while for longer races, bad weather may increase the average duration by 10 or 15 minutes.

Model equation with interaction

We capture such pattern using an interaction term.

\[ \text{duration} \approx \beta_0 + \beta_1 \text{distance} + \beta_2 \text{weather_bad} + \beta_3 \text{weather_bad}*\text{distance} \]

  • When the weather is good, the equation simplifies to

\[ \text{duration} \approx \beta_0 + \beta_1 \text{distance} + \beta_2 0 + \beta_3 0*\text{distance} = \beta_0 + \beta_1 \text{distance} \]

  • When the weather is bad, the equation simplifies to

\[ \text{duration} \approx \beta_0 + \beta_1 \text{distance} + \beta_2 1 + \beta_3 1*\text{distance} = (\beta_0 + \beta_2) + (\beta_1+\beta_3) \text{distance} \]

Interpreting interactions

  • When the weather is good, the slope estimate is \(\hat{\beta}_1\), meaning that the effect of an additional miles on the average duration is \(\hat{\beta}_1\).

  • When the weather is bad, the slope estimate is \(\hat{\beta}_1+\hat{\beta}_3\), meaning that the effect of an additional miles on the average duration is \(\hat{\beta}_1+\hat{\beta}_3\) (not \(\hat{\beta}_1\)).

Interactions

The effect of the distance depends on the weather. Similarly, the effect of the weather depends on the distance.

\(\Rightarrow\) the two variables interact.

Recap

Recap

  • simple linear regression model \[ Y \approx \beta_0 + \beta_1 X \]

  • multiple linear regression model \[ Y \approx \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + + \beta_p X_p \]

  • categorical predictor

    • \((k-1)\) indicator variables
  • feature engineering

    • transforming variables
    • combining variables